Stochastic Gradient Descent (SGD)#

../../_images/sgd-meme.png

Stochastic Gradient Descent is an optimization algorithm that differs from the usual Gradient Descent in that the gradient of the optimized function is considered at each step not as the sum of gradients from each element of the sample, but as the gradient from one randomly selected element

Gradient Descent (GD) review#

Before going into the study of stochastic gradient descent, let’s recall the formula of ordinary gradient descent. We have the following:

  • \(\boldsymbol w\) = \({(w_1, w_2, ..., w_n)}^{T}\) is a vector of parameters, where \(n\) is the size of the \(\boldsymbol w\)

  • \(\nabla \mathcal L(\boldsymbol w)\) — gradient of the function (the meaning is shown (1))

  • \(Q(\boldsymbol w)\) — function which we tend to minimize where with respect to “\(\boldsymbol w\)” or

\[ Q(\boldsymbol w) = \sum\limits_{i=1}^\ell \mathcal L_i(\boldsymbol w) \to \min\limits_{\boldsymbol w} \]
  • \(\ell\) — number of functions
    Choose \(\boldsymbol w^{(0)}\) — initial vector approach. Then, each subsequent vector of parameters will be calculated as

(5)#\[\boldsymbol w = \boldsymbol w - \eta \sum\limits_{i=1}^\ell \nabla \mathcal L_i(\boldsymbol w)\]

where \(\eta\) — gradient step or learning rate (responsible for how much to change the vector of parameters in the direction of the gradient)

The stopping of the algorithm will be determined by the convergence of \(Q\) or \(\boldsymbol w\). The graph below represents the Gradient Descent algorithm (GD) for the function \(f(x) = x^2 - 5x + 5\) with \(\eta = 0.85\).

Stochastic Gradient Descent (SGD)#

GD algorithm problem#

Considering the previous algorithm (5), we come to a problem. To determine a new approximation of the vector \(\boldsymbol w\), it is necessary to calculate the gradient of each element i.e. \( \sum\limits_{i=1}^\ell \nabla \mathcal L_i(\boldsymbol w)\). In case of application in Machine Learning (ML), if there is a huge number of \(\ell\) (input data), it takes a very long time to calculate all the gradients and can significantly slow down the algorithm since it is very resource - intensive. During one iteration, the GD algorithm performs the summation of \(\ell\) functions’s gradients with the \(n\) sized vector \(\boldsymbol w\). So, the difficulty for 1 iteration in terms of Big O notation is O(\(\ell n\)).

SGD derivation#

In order to improve the efficiency of the algorithm, at each step, instead of calculating the whole sum of gradients (\(\sum\limits_{i=1}^{\ell} \nabla \mathcal L_i(\boldsymbol w)\)) one can be considered (\(\nabla \mathcal L_k(\boldsymbol w)\)), i.e.

(6)#\[\boldsymbol w = \boldsymbol w - \eta \nabla \mathcal L_k(\boldsymbol w)\]

According to the formula (6) during the 1 iteration we calculate the gradient of 1 selected function \(\nabla \mathcal L_k(\boldsymbol w)\) with \(n\) sized vector \(\boldsymbol w\). It means that Big O is O(\(n\)).

where, \( 1\leqslant k \leqslant \ell\) (usually random)

SGD Realization Pseudocode#

Algorithm (Stochastic Gradient descent)

To solve the optimization problem

\[ Q(\boldsymbol w) = \sum\limits_{i=1}^\ell \mathcal L_i(\boldsymbol w) \to \min\limits_{\boldsymbol w} \]

do the followind steps: Choose learning rate \(\eta \) (0 < \(\eta \) < 1) and rate of forgetting \(\lambda\) (0 < \(\lambda\) < 1 )

  1. Initialization of \(\boldsymbol w\) with some initial values (e.g., from \(\mathcal N(0, 1\)))

  2. Initial calculation of the function: \(Q (\boldsymbol w) := \frac 1\ell \sum\limits_{i=1}^\ell \mathcal L_i(\boldsymbol w)\)

  3. While \(Q\) doesn’t converge or \(\boldsymbol w\) doesn’t converge DO:

  • Choose random index value \(k\) where, \(1 \leqslant k \leqslant \ell\)

  • Recalculate vector: \(\boldsymbol w := \boldsymbol w - \eta \nabla \mathcal L_k(\boldsymbol w)\)

  • Recalculate function: \(Q := \lambda \mathcal L_k(\boldsymbol w) + (1-\lambda) Q\)

It is necessary to understand that the phrase "While $Q$ doesn't converge or $\boldsymbol w$ doesn't converge DO" in SGD algorithm means that we continue to recalculate $Q$ and $\boldsymbol w$ while at least one of them will have approximately the same parameters at the current step (i.e. almost unchanged) as at the previous one.

The graph below represents how the SGD algorithm works with different \(\eta\) (learning rate) for the function \(f(x) = x^2\). It tries to converge to the optimum (\(x = 0\)) when 0 < \(\eta < 1\) and diverges as \(\eta > 1\). In case of \(\eta = 1\) it is looped and if \(\eta = 0\) then it doesn’t move.

WARNING:tensorflow:From C:\Users\temir\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
WARNING:tensorflow:From C:\Users\temir\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\optimizers\legacy\optimizer_v2.py:806: The name tf.executing_eagerly_outside_functions is deprecated. Please use tf.compat.v1.executing_eagerly_outside_functions instead.

SGD Exercise:#

Given:
\(\boldsymbol w = (w_1, w_2, w_3) ^ T = (1, -1, 0.5) ^ T\)
\(\eta = 0.1\)
\(\lambda = 0.8\)
\(\mathcal L(\boldsymbol w) = w_1^{2} + w_2^{2} + 0.5 w_3^{2}\)
What is the sum of all parameters of the \(\boldsymbol w\) in the next iteration?

display_quiz("q_w1")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 display_quiz("q_w1")

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\jupyterquiz\dynamic.py:176, in display_quiz(ref, num, shuffle_questions, shuffle_answers, preserve_responses, border_radius, question_alignment, max_width, colors)
    173 else:
    174     #print("File detected")
    175     script += f"var questions{div_id}="
--> 176     with open(ref) as file:
    177         for line in file:
    178             script += line

FileNotFoundError: [Errno 2] No such file or directory: 'q_w1'
display_quiz("#q_w1")
display_quiz("q_w1")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[19], line 1
----> 1 display_quiz("q_w1")

File c:\Users\temir\AppData\Local\Programs\Python\Python311\Lib\site-packages\jupyterquiz\dynamic.py:176, in display_quiz(ref, num, shuffle_questions, shuffle_answers, preserve_responses, border_radius, question_alignment, max_width, colors)
    173 else:
    174     #print("File detected")
    175     script += f"var questions{div_id}="
--> 176     with open(ref) as file:
    177         for line in file:
    178             script += line

FileNotFoundError: [Errno 2] No such file or directory: 'q_w1'

Mini Batch GD#

Mini batch gradient descent is actually a mixture of Batch Gradient Descent and SGD. It computes the gradient of the function over small random subsets of the elemets (mini batches).

The uniqueness of the Mini Batch GD compared to the SGD is that instead of using only 1 random index, the subset of indexes is used. Then we do the following:

  1. The subset of indexes can be considered i.e \(i_1 < i_2 < ... < i_B\), where \(1 < B < \ell\)

  2. In Mini Batch GD, the function \(E(\boldsymbol w)\) is calculated over the subset of indexes i.e. \(E_{B}(\boldsymbol w) = \sum\limits_{j=1}^B \mathcal L_{i_j}(\boldsymbol w)\)

  3. The vector \(\boldsymbol w\) is calculated over the subset of indexes i.e. \(\boldsymbol w = \boldsymbol w - \eta \sum\limits_{j=1}^B \nabla \mathcal L_{i_j}(\boldsymbol w)\)

Considering the difficulty using Big O notation for Mini Batch GD algorithm, we see that during 1 iteration we perform the summation of \(B\) number of gradients’ functions with the \(n\) sized vectors \(\boldsymbol w\). So, the Big O notation is O(\(B n\)).

Mini Batch GD VS SGD#

Compared to the SGD, Mini Batch GD has less oscillations. It also requires more memory and computational resources than SGD. The graph below represents the value of \(Q (\boldsymbol w)\) at each iteration step.

As it can be shown, Mini Batch GD converges faster than SGD because we can make more precise update to the parameters by calculating the average of the \(\mathcal L (\boldsymbol w)\).

Heuristics#

According to the description, “heuristics” is understood as a set of techniques as well as the methods that facilitate and simplify the solution of corresponded tasks. Here is the list of suggested heuristics during the work with the SGD:

  • Initialization of the vector \(\boldsymbol w\)

  • Choice for learning rate \(\eta\)

Initialization of the vector \(\boldsymbol w\)#

One of the crucial steps during SGD algorithm is initialization of the \(\boldsymbol w\). It is common to set zero vector as initial values, but there are also other approaches.

  1. So, w = 0;

Another way is to take small random \(w\), since if initialize large initial values it can lead to an area with small modulo derivatives and therefore it will be difficult to get out of such an area.

  1. So, \(w_j\) = random(\(-\frac 1{2n}\), \(\frac 1{2n}\)) (The denominator is devided by “\(2n\)” because if case of \(n\) = 1 the learning rate would be 1, but that’s an unppropriate value. So, it can be at least 0.5 in this case.)

Dynamic Learning Rate \(\eta\)#

As an optimization, learning rate variable \(\eta\) value can be dynamicly changed at each step of iteration. Replace \(\eta\) with a time-dependent learning rate \(\eta(t)\). There is a list of some commonly used formulas for \(\eta(t)\).

  1. \(\; \eta(t)=\frac 1{t}\), inverse (hyperbole) decay

  2. \(\; \eta(t)=\eta_o \cdot e^{-\lambda t}\), exponential decay

  3. \(\; \eta(t)=\eta_o \cdot (\beta t + 1)^{-\alpha}\), polynomial decay

Note

If \(\eta(t)\) changes too quick, then stop optimizing prematurely. If we decrease it too slowly, we waste too much time on optimization.

Constant learning rate VS Dynamic learning rate#

Firstly, look at the SGD with \(\eta\) = 0.1

Advantages and Disadvantages#

As any algorithm, the SGD has its own pros and cons.
Advatages:

  • Simple for implementation

  • Suitable for the exercises with huge number of data (input points)

  • Requires less memory and computational resources than GD

Disadvantages:

  • It can simply get stuck in local optima

  • It has strong fluctuation

Optimizers of Gradient Algorithms#

The main disadvantage of gradient algorithms is getting stuck in local optima (minimum).

https://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Extrema_example_original.svg/1200px-Extrema_example_original.svg.png

Moreover, at such points, the derivative is zero, which means that the further change of vector parameters \(\boldsymbol w\) according to the formula (6) ( \(\nabla Q(\boldsymbol w) = \nabla L_k(\boldsymbol w)\) = 0 ) will not happen (we will stuck).

In addition, stochastic gradient algorithms can form strong fluctuations when moving to the minimum point of the function

For solving above problems there were dedicated several solutions. See below

Momentum Gradient Descent#

../../_images/momentum-meme.png

It was proposed to average the gradient in steps using the formula:

(7)#\[\boldsymbol v = \gamma \boldsymbol v + (1-\gamma)\eta_t \cdot \nabla Q(\boldsymbol w)\]

where, \(\boldsymbol v\) - accumulates the average of the gradients
\(\;\;\;\;\;\;\;\;\;\gamma\) - parameter (using it, you can adjust how many past gradients we will take into account in the value of \(\boldsymbol v\))

Note

if rewrite the \((1-\gamma)\eta_t\) from the formula {eq}’momentum-formula’ to \(\eta_t = (1-\gamma)\cdot\eta_t\) we obtain the algorithm: \(\boldsymbol v = \gamma \boldsymbol v + \eta_t \cdot \nabla Q(\boldsymbol w)\)

Finally, just change the \(\boldsymbol w\) using the familiar principle:

(8)#\[\boldsymbol w = \boldsymbol w - \boldsymbol v\]

where, \(\boldsymbol v\) - smoothed gradients

NAG - Nesterov’s accelerated gradient#

Taking into account the already mentioned formula in momentum:

\[\boldsymbol v = \gamma \boldsymbol v + \eta_t \cdot \nabla Q(\boldsymbol w)\]

where, \(\eta_t = (1-\gamma)\cdot\eta_t\) for simplicity

Now, we can represent the “\(\boldsymbol v\)” as the sum of the vectors “\(\gamma \boldsymbol v\)” and “\(\eta_t \cdot \nabla Q(\boldsymbol w)\)”.

../../_images/sgd-momentum.png

But at the time of calculating the gradient, we already know that we will shift by the value of “\(\gamma \boldsymbol v\)”. Therefore, it would be more correct to calculate the gradient not at point \(\boldsymbol w\), but at point (\(\boldsymbol w - \gamma \boldsymbol v\)) or:

../../_images/sgd-nesterov.png

The detailed step by step NAG visualization algorithm to quadratic function (\(f(x) = x^2\)) is shown below. The moving ahead step is also added.

../../_images/nesterov-detailed.png

Mathematically NAG can be expressed as:

(9)#\[\boldsymbol v = \gamma \boldsymbol v + \eta_t \cdot \nabla Q(\boldsymbol w-\gamma \boldsymbol v)\]

And such an approach, as a rule, shows the best convergence to the optimum point.

SGD VS with Momentum VS with Nesterov Momentum#

Visualization for \(f(x) = x^2sin x\)#

../../_images/3595144764983580fb0444a3648a237adc1a6e1347509f263e5604dac5a4fc1c.png

Visualization for \(f(x) = x^2\)#

../../_images/1ef0900fc153d0a9ba8742fb20e2e55c2aa2a7faceea1e663b0269eec6d2f91b.png
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Define the objective function (multidimensional)
def objective_function(x, y):
    return x**2 + y**2 + 0.3 * x * y

# Gradient of the objective function
def gradient(x, y):
    return np.array([2*x + 0.3*y, 2*y + 0.3*x])

# Stochastic Gradient Descent algorithm with noise and convergence threshold
def stochastic_gradient_descent(initial_point, learning_rate, num_iterations, noise_scale=0.05, convergence_threshold=0.01):
    points = [initial_point]
    for _ in range(num_iterations):
        grad = gradient(*points[-1])
        noise = noise_scale * np.random.randn(2)  # Adding Gaussian noise
        update = -learning_rate * (grad + noise)
        new_point = points[-1] + update
        points.append(new_point)
        
        # Check convergence
        if np.linalg.norm(update) < convergence_threshold:
            break
    
    return np.array(points)

# Initial parameters
initial_point = np.array([3, 3])
learning_rate = 0.1
num_iterations = 50

# Create a contour plot of the objective function
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = objective_function(X, Y)

contour = go.Contour(x=x, y=y, z=Z, contours=dict(showlabels=True, labelfont=dict(size=12)), colorscale='viridis', contours_coloring='lines')
trajectory_trace = go.Scatter(x=[], y=[], mode='lines+markers', line=dict(color='red', dash='dash'), marker=dict(color='red'), name='Trajectory')

# Update the frames for animation
frames = []
trajectory = stochastic_gradient_descent(initial_point, learning_rate, num_iterations, noise_scale=3.5, convergence_threshold=0.1)
for i in range(len(trajectory)):
    frame = go.Frame(
        data=[go.Scatter(x=trajectory[:i+1, 0], y=trajectory[:i+1, 1], mode='lines+markers', line=dict(color='red', dash='dash'))],
        name=f'SGD Step {i+1}',
        traces=[1]
    )
    frames.append(frame)
    
# Create subplot with 3D scatter plot
fig = go.Figure(
        data=[
            go.Scatter(x=[initial_point[0]], y=[initial_point[1]], mode='markers', marker=dict(color='red'), name='Initial Point'),
            trajectory_trace,
            contour
        ],
        layout=go.Layout(
            title='Stochastic Gradient Descent Simulation', 
            showlegend=False,
            updatemenus=[dict(type='buttons', showactive=False, buttons=[dict(label="Play",
                          method="animate",
                          args=[None, {
                                        "fromcurrent":True, 
                                        "transition": { 
                                                    "mode": "immediate",
                                                    "duration": 0
                                        },
                                        "redraw": False
                                        },
                                        
                                ])])]
            
        
        
        ),
        frames=frames
    )

# Show the plot
fig.show()